Matching Titles with Cross Title Web-Search Enrichment and Community Detection

نویسندگان

Nikhil Londhe

Vishrawas Gopalakrishnan

Aidong Zhang

Hung Q. Ngo

Rohini K. Srihari

چکیده

Title matching refers roughly to the following problem. We are given two strings of text obtained from di↵erent data sources. The texts refer to some underlying physical entities and the problem is to report whether the two strings refer to the same physical entity or not. There are manifestations of this problem in a variety of domains, such as product or bibliography matching, and location or person disambiguation. We propose a new approach to solving this problem, consisting of two main components. The first component uses Web searches to “enrich” the given pair of titles: making titles that refer to the same physical entity more similar, and those which do not, much less similar. A notion of similarity is then measured using the second component, where the tokens from the two titles are modelled as vertices of a “social” network graph. A “strength of ties” style of clustering algorithm is then applied on this to see whether they form one cohesive “community” (matching titles), or separately clustered communities (mismatching titles). Experimental results confirm the e↵ectiveness of our approach over existing title matching methods across several input domains.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Web Page Titles to Rediscover Lost Web Pages

Titles are denoted by the TITLE element within a web page. We queried the title against the the Yahoo search engine to determine the page’s status (found, not found). We conducted several tests based on elements of the title. These tests were used to discern whether we could predict a pages status based on the title. Our results increase our ability to determine bad titles but not our ability t...

متن کامل

Syntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study

A research article (RA) title is the first and foremost feature that attracts the reader's attention, the feature from which she/he may decide whether the whole article is worth reading. The present study attempted to investigate syntactic structures and rhetorical functions of RA titles written in English and Persian and published in journals in three disciplines of Electrical Engineering, Psy...

متن کامل

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matchin...

متن کامل

Effectiveness of Title-search vs. Full-text Search in the Web

Search engines sometimes apply the search on the full text of documents or web-pages; but sometimes they can apply the search on selected parts of the documents only, e.g. their titles. Full-text search may consume a lot of computing resources and time. It may be possible to save resources by applying the search on the titles of documents only, assuming that a title of a document provides a con...

متن کامل

Syntactic Structures in Research Article Titles from Three Different Disciplines: Applied Linguistics, Civil Engineering, and Dentistry

Deducing what a paper is about, titles are considered as the most important determinant of how many people will read the article. Therefore, studying the use of different syntactic structures and their rhetorical functions in titles is of great significance. The current study was set to investigate these structures used in research article titles in three disciplines of Applied Linguistics, Den...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 7 شماره

صفحات -

تاریخ انتشار 2014

Matching Titles with Cross Title Web-Search Enrichment and Community Detection

نویسندگان

چکیده

منابع مشابه

Using Web Page Titles to Rediscover Lost Web Pages

Syntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

Effectiveness of Title-search vs. Full-text Search in the Web

Syntactic Structures in Research Article Titles from Three Different Disciplines: Applied Linguistics, Civil Engineering, and Dentistry

عنوان ژورنال:

اشتراک گذاری